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Text line extraction from a text document image and segmenting it into 
isolate words and segmenting these words into individual characters are 
considered as one of the most critical processes in optical character 
recognition (OCR) systems development and turning the document into a 
searchable electronic representation, this paper presents a new approach to 
analyze the Arabic text documents, the proposed approach contains four 
steps, preprocessing, text line segmentation, word segmentation, character 
segmentation. The horizontal projection method are used to detect and 
extract the text line from preprocessed text documents image, in word 
segmentation step. The space threshold are computed to determine the 
spaces among connected components in text line as within-word space or 
between-words space for segmenting the text line into isolate words, finally 
thinning method applied to find the skeleton of segmented word and 
analyses geometric characteristics of the characters to detect ligatures and 
characters. The proposed approach was tested and evaluated on a set of 115 
text images, this set contains images from the King Fahd University of 
Petroleum and Minerals (KFUPM) handwritten Arabic text (KHATT) 
database and some images produced by the authors. The experiment results 
are extremely encouraging, with a success rate of 98.6% for lines 
segmentation, 96% for words segmentation, and 87.1% for characters 
segmentation. 
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1. INTRODUCTION 


Text-line, word, and character segmentation is the technique by which the fundamental elements in 
a text document image are localized and extracted. Segmentation is a critical stage for handwriting and 
printed recognition, it is the most important step in in online and offline character recognition [1], [2]. It is 
the most important and most challenging phase in optical character recognition (OCR). In order to recognize 
words or characters, the OCR systems must first break the text into lines, then segment the lines into words, 
and then word into characters [3], [4]. The bad segmentation method causes misrecognition or rejection [5]. 

Segmentation of text lines includes both detection and extraction of text lines. Detection of text lines 
commonly locates text line patterns, while text line extraction assigns pixels to text lines with precision. The 
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text line includes a sequence of, and words are usually made up of a number of sub-words (characters, related 
components) that are spaced by spaces. In Arabic handwriting, spaces are divided into two types: within- 
word space, which is the space between sub-words of the same word, and between-word space, which is the 
gap between two consecutive words. 

The spaces in Arabic handwriting do not adhere to any rules because each individual has his or her 
own unique writing style, and thus each writer has his or her unique way of generating gaps in between 
words. Extraction of word consists of identifying between-word spaces. When there is not enough space 
between-word in Arabic handwriting, separation becomes difficult. The majority of the approaches proposed 
in the literature for extraction of words involve measuring a threshold to characterize the gaps between the 
words (between-word space) and between the linked components of the same word (within-word space) [6]. 

Detecting text-lines, words, and characters in Arabic documents remains a challenge. The Arabic 
documents are regarded to be more sophisticated than other manuscripts written in other languages. This 
intricacy arises initially from handwritten text features which may vary in writing style, size, orientation, 
alignment, and where consecutive text-lines might be touched or overlapped, and the second intricacy from 
the Arabic writing nature: cursiveness of the text, character intersecting, diacritics, diversity of calligraphy, 
words are frequently split into letters and sub-words, and the spaces between them are varied. 


2. RELATED WORK 

Segmentation of offline characters is a crucial step before feature extraction and therefore character 
recognition. In the literature, a variety of text-line and word extraction techniques have been proposed, for 
text-line segmentation, Kumar et al. [7] has developed an approach based on graphs for handwritten text lines 
extraction, the approach is highly resistant to variances in font size and to non-uniform asymmetry. 
Shi et al. [8] proposed a technique based on a directional filter and a local connectivity map of generalized 
adaptive. It works well for varying, touching, or crossing text lines. 

Barakat et al. [9] suggested an unsupervised method for extracting text lines, which was driven by 
the relative variation in text lines and space between text lines. The number of foreground pixels over text 
lines differs significantly from the number of foreground pixels over text line gaps. A Siamese convolutional 
network is used in this technique to predict whether two given document picture patches are similar or 
distinct, based on the number of foreground pixels in the patches. Alghamdi et al. [10] utilized projection of 
horizontal for segmentation of the historical document image into text lines, this approach transforms the 
image from two dimensions into one dimension by computing the pixels of all rows. Finally, the quantity of 
the minimum pixel is used for cropping all lines in the historical document. 

Arvanitopoulos and Susstrunk [11] utilized a seam carving technique, which is a top-down method. 
They use a projection profile matching method to calculate medial seams on the text lines initially. Then, 
using a modified version of the seam carving method, they calculate the separating seam. On the Arabic 
dataset they utilized, their approach yielded positive results (99.9%). Ouwayed and Belaïd [12] used analysis 
of morphology to determine the Arabic words last letters; the suggested method was tested on overlapping 
texts and found to be highly efficient. 

For extraction of words, in literature, most of the approaches given employ a certain measure of the 
distance between connecting successive elements and establish a threshold for classifying the gaps between 
words (between-words space) and linked constituents (between-words space) from the same word (within- 
word space) [13]. To establish an appropriate threshold [14] suggested an approach based on the magnitude 
of the gaps, the method classified the gaps among connected components into three sets. 

For extraction of character, in art literature, there are many techniques are used, the segmentation of 
explicit methods segments a word image into a number of tiny components, whereas the segmentation of 
implied method combines the segmentation and recognition stages by segmenting words into characters and 
recognizing them at the same time. The character segmentation algorithms based on the techniques used can 
be categorized into projection profile-based methods, character skeleton-based methods, contour tracing- 
based methods, template matching-based methods, neural network (NN) based methods, hidden markov 
(HM) models-based methods, line adjacency graph-based methods, morphological operations-based methods, 
and recognition-based segmentation methods [15]. 

Projection profile based methods are based on the fact that the connecting stroke among successive 
letters is thinner than the letter itself. The projection of horizontal is used to separate lines and identify text 
baseline, while the projection of vertical is used to segment words, sub-words, and characters. These methods 
compute the projections of vertical and horizontal method is use in Alghamdi et al. [10] at the first, the pre- 
processing is done for cleaning the historical document image. The proposed method utilized projection of 
Horizontal for segmentation of the historical document image into text lines, and projection of vertical for the 
segment the text lines into characters, and then the erosion followed by a dilation are applied for each text 
line. Finally, the density is calculating for all columns and segment the characters. Anwar et al. [16] the 
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method looks for possible segmentation locations in a segmented word image based on the fact that the 
connecting stroke among successive letters is thinner than the letter itself. Character skeleton is another 
technique is used for segmentation the word into characters, in the recognition of character, the skeleton of a 
shape contains all of the necessary information. 

In general, a number of approaches have been reported to extract skeletons in the literature, the 
approaches proposed specifically for Arabic are [17] presented a novel skeletonization technique based on 
clustering the character image for solitary Arabic letters. After that, the skeleton was created by locating the 
neighboring matrix of various clusters. They finished by eliminating unimportant vertices from the skeleton. 
To cluster the Arabic letter, Altuwaijri and Bayoumi [18] used a self-organizing neural network. The skeleton 
was created by plotting the cluster centers and linking neighboring clusters in a straight-line succession. 
Cowell and Hussain [19] utilized an iterative mitigation method with post-processing to create thin shapes of 
segmented Arabic letters. They also discussed the issues of thinning Arabic letters from poor image quality. 

Contour tracing method used in Osman [20] the proposed algorithm divides the acquired image into 
text lines and connected components (sub words). Then, the contour of every connected component is traced. 
Then, the exact points extract depending on changing of the contour state from the line of vertical to a 
horizontal line or vice versa. The last step is these points' coordinates are then used as the separation points. 
The method in Wshah et al. [21] depend on the idea that every linked letter inside sub-word has junction 
points in the letter skeleton, this method analysis skeleton of a sub-word image to determines the junction 
points, in order to determine the shortest path to the extracted junction points the contour of image is 
analyzed. Finally, the points of segmentation are the first three lower peaks of the distance map between the 
intersection points and the chain code. Mohammad et al. [22] the contour segmentation method is used, the 
proposed method is composite of four phases, the method inputs are the binary image of word/sub-word, the 
first step is to extract the connected component contour of the word/sub-word, from extracted contour the 
points of start and end are determined by contour tracing as up-contour, and then the identify the splitting 
point from extracted up-contour by using pixels values, and finally, the post-processing phase tries to identify 
each portion in order to evaluate if it needs to be combined or a separate character. In Omidyeganeh et al. 
[23] the method collects information on the word's general shape by identifying the contour of word, which 
depicts the pixels that make up the word's outer shape. It uses a representation of the word shape (contour) 
based on the fact that each letter has a high contour followed by a flat or low contour, with the segmentation 
points identified before the contour starts rising, to find the potential segmentation point. Neural network 
used in Radwan et al. [24] a model consists of multi-channel NN as input layers, the input of these layers are 
three windows. The contains a sliding window as a middle channel, as well as next and prior windows for 
further context. The suggested model predicts that the present window is probably a segmentation area. The 
channel on the left is responsible to learn the characters' right parts, while the channel on the right is 
responsible for learns the characters left parts, and the middle channel is responsible for learns the region in 
among. An output layer learns the relationships among every channel’s unique property in order to determine 
the correlation among them. 

Graph approach is used in Elgammal et al. [25] for Arabic character segmentation, the suggested 
method is based on the morphological relationship among the base-line and the line neighboring graph 
(LAG) text representation. Hidden Markov method (HMM) is the suggested method contains two modules: 
the trainer module is responsible to prepare and train HMM, this module trained on printed Arabic text and 
the isolation module is responsible to segment input images into letters [26]. 

Connected component-based method is used in Alirezaee et al. [27] the text lines are segmented into 
a linked block (word/sub-word or characters). Before segmentation of the text document image, the 
preprocessing was applied for cleaning the document. Then, morphological linked block extraction is utilized 
for extraction of the line block utilizing a specific formula. Finally, in post-processing applied specific 
conditionals and formulas to text elements extraction. Also this technique with a clustering algorithm was 
utilized in Ouwayed and Belaid [12]. Firstly, the picture is transformed into a black and white image, then 
this binary image going to segment into linked blocks. Some points in the center of a linked block that could 
be used as a feed to clustering method. Following that, the points are grouped utilizing the k-means method, 
finally, ultimate lines are generated. 

A recognition-based character segmentation technique is used in Inkeaw ef al. [28] The 
segmentation and recognition stages are carried out in this technique at the same time. The technique 
searches for components in word segments that fit specified classes in its alphabet and divides them into their 
letters without breaking them into smaller units. This method also used in Elnagar and Bentrcia [29] at the 
first, the pre-processing is done for cleaning and extracting the features for the image, then the pre-processed 
text image was segmented into text lines, and the segmented text lines were segmented into words. For each 
segmented word, the thinning is applied, and then the main linked block was obtained in all thinned words, 
three kinds of regions features were extracted from these main linked blocks, seven factors are used to 
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determine the start and end of the cutting points, finally, the features were extracted from segmented regions 
and feed it artificial neural network to classify it as stroke or a character. 


3. PROPOSED APPROACH 

In this work, we proposed an approach for segmentation of Arabic text document images into lines, 
words, and characters. The input of system is text document images (printed or handwritten). The proposed 
approach involves four main steps: preprocessing, line segmentation, word segmentation, and character 
segmentation. The preprocessing step aims at cleaning up the text document images for the next 
segmentation steps. Segmentation of lines step aims to isolate the text lines of the text document image. The 
segmentation of words step allows us to segment the segmented text lines into word images. We compute 
thresholding (T) from each extracted text line. The T is used to determine the distances between connected 
components and isolate these connected components with between-word space. The last step is to divide 
those words into individual characters. In order to recognize ligatures and characters, the geometric 
characteristics of the characters are utilized. 


3.1. Preprocessing 

The preprocessing stage involves preparing the text document images before segmenting them to 
simplify the segmentation steps processing, and the output of this stage is used to improve the segmentation 
procedure performance. Figure 1 depicts the results of Arabic text image pre-processing step. Preprocessing 
step includes binarization, slant correction, and cropping process: 

- Binarization aims to separate the text image background whether the original image is in color or 
grayscale, resulting in a two-tone image with black background and white in the text or the reverse [30]. 
Figure 1(a) shows the original text image and Figure 1(b) depicts the result of the binarization step. 

- Slant correction: Slant correction, also known as a skew correction, seeks to adjust the inclination angle 
of a text image [31]. Figure 1(c) depict result of Slant correction step. 

- Cropping aims to eliminate the excess space surrounding the rectangular region carrying the picture of the 
noise-free text document image. Figure 1(d) depict result of cropping step. 
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Figure 1. Arabic text image pre-processing: (a) original text image, (b) text image after binarization, (c) text 
image after slant correction, and (d) text image after cropping 


3.2. Text line segmentation 

In order to segment of text line, we used the method presented in Lamsaf et al. [32] method used 
horizontal projection approach. The method first calculates the histogram of the horizontal projection of the 
text document image, and then impacts the value of O for lines with sums less than or equal to the threshold 
1=12. Second, the method recalculates the histogram of horizontal projection and computes the difference 
between each two successive vector components of the result. Third, it calculates the local maximum and 
minimum of the previous step's result, and then impacts local maximum and minimum neighbourhood lines 
with the value 1, finally separating the text lines. The text line segmentation results are shown in Figure 2. 
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Figure 2. Result of text line segmentation 


3.3. Word segmentation 

The spaces in Arabic writing are two types: The spacing among sub-words of a single word is 
known as within-word space, and the spacing among two words is known as between-word space, detecting 
between-word spaces is the first step in word segmentation. Let CSW = (CSW ,,CSW 3, ..., CSW, ) be the 
spaces in a text line, for the space threshold T, we determine one space CSW ; as a word space when the sum 
of the number of consecutive foreground pixels with a value of zero is greater than space threshold T, The 
space threshold is the average of all gaps among linked components in the text line, All word space records 
as SW = (SW,,SW,,...,SW,). The word segmentation process as shown in Algorithm 1 and the word 
segmentation results are shown Figure 3. 


Algorithm 1. Word segmentation 
Input: TLI as a text line image. 
ListOfCSW-ListOfCountOfForegroundColorInEachClumons where Count=0. 
CSWMergeConsecutiveListOfCsSw. 
T~ComputeSpaceThresholding. 
while each segment in CSW. 
if CSW >T. 
SWI-Segment (CSW). 
End while. 
Output: segmented word images. 
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Figure 3. Result of segmented text line into word 


3.4. Character segmentation 

Arabic characters have three types. First category the characters containing closed loop such as 
HAA “4”, TAH “b”, SAD “oe”, DTAD “o+”, QAF “3”, the second category the characters containing semi- 
loop such as LAM P, TAA “o”, BAA “2”, NOON “oo, AIN “¢”, and the third category characters is 
similar to ligatures such as RAA “J”, and ALIF MAQSURA “e SS" 

In open characters, deten chida between ligatures Sad character segments is difficult. Ligature is a 
term used to describe a connection between two or more consecutive letters. In written Arabic language 
words, Successive TAA “”, NOON “wu”, and THAA “S” may look like SHEEN “=” or SEEN “=” and vice 
versa. Successive FAA “# * and HMZA “9 may appear as DHAD “==”. The proposed algorithm (Algorithm 
2) analyses geometric characteristics of the characters to detect ligatures and characters. 
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Algorithm 2. Character segmentation 
nput: SWI as segmented word images. 
OIMG-CopyFromImage (SWI ). 
WSL-SkeletonOfWord (SWI). 
ODAD-OmitDotsAndDiacritics (WSL). 
ListOfCSC-ListOfCountOfForegroundColorInEachClumons where Count=0 or 1. 
CSC-MergeConsecutiveListOfCSC. 
LEP-LastOfEndPointFromLeft. 
[set CSC] _n=LEP where [setLEP-CSC] _n<d, d=4, n is last CSC. 
ListOfsegmentedCharacter-CSC (Segment, OIMG) 
For Each Segment in ListOfsegmentedCharacter. 
if structural similarity of merged TwoConsecutiveSegments S_i, s_ (itl) ~ structural 


then merge S_i, s_(i+1) 


Similarity of “pw” or “a” or My" or “La 
if structural similarity of merged ThreeConsecutiveSegments S_i, s_ (itl) and s_(i+3) 


n" 


x structural similarity of “w” or “_w” or or “_i” then merge S_i, s (itl) and 
s (i+3) 


Output: Sassegmented character images. 
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3.5. Description of the proposed algorithm 

Before starting the analysis of geometric characteristics of the characters, we applied thinning 
algorithm on segmented word images to make their stroke width 1 pixel as shown in Figure 4. The thinning 
technique creates the image's skeleton. A skeleton has a width of one pixel and is created by outlining the 
word's centerline. The existence of dots and diacritics above or down of some opened characters causes an 
over-segmentation error. We used the connected component algorithm to omit the dots and diacritics from 
the image. First, we detect all connected components and then deleted all small, connected components. 
Figure 5 shows the output after omitted the dots and diacritics from word image. 

After omitting the dots and diacritics, the word image scanned from top right to bottom left and 
compute the number of foreground pixels for each column, the number of foreground pixels with 0 or 1 are 
termed as a list of candidate segmentation columns (LCSC), and Figure 6 shows the LCSC of the word image. 
As shown in Figure 7 the consecutive LCSC are merged and termed as candidate segmentation columns (CSC). 


Figure 4. Skeleton of word Figure 5. Word images after omit dots and diacritics 


Figure 6. Word image with LCSC Figure 7. Word image with CSC 


cco 99 2 
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Over segmentation occurs in open character when coming at the end of a word like “—" Us oa 
to solve this over-segmentation we detect the last end point LEP and subtract the x-axis value of LEP from 
the x-axis value of CSCn if the result R satisfied R<D, we replace CSCn by LEP as shown in Figure 8 and 
Figure 9, factor D determines the distance between LEP and CSCn and we have an experienced value: D=4. 


39 cc 


Two other over-segmentation Appeared as shows in Figure 10, first one occurs in “ue v=” letters 
3 
39 66 599 


as shown in Figure 10(a), second occurs in “o”, “cè” letters as shown in Figure 10(b). Step 10 aim to solve 
these over segmentations, the result of this step shown in Figure 11. 


Last CSC 
al al 
Last End Point(LEP) as fee = | 
Figure 8. Word after detecting last end point and CSC, Figure 9. Segmentation after applied step 8 


An approach to analysis of arabic text documents into text lines, words, and characters (Hakim A. Abdo) 


760 o ISSN: 2502-4752 


eon pe! 


Over Segmentation 
9 Over Segmentation 


(a) (b) 


39 cc 


Figure 10. Over-segmentation of (a) over-segmentation occurs in “oe”, “o2” and (b) over-segmentation 
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In step 10 to solve the over-segmentation occurs in “oe”, “u2”, we compared structure similarity of 
two merged consecutive segmented characters (si and si+1) with structure similarity of “oe” or “=e”, if it 
matches, we merge the Si and Si+1 in one segmentation character as shows Figure 11(a). We used ssim 
function to computes the structural similarity index between two images. Likewise, we compared structure 
similarity of three merged consecutive segmented character (Si, Si+1, and Si+2) with structure similarity of 
or BES), if it is match, we merge the Si, Si+1 and Si+2 in one segmentation character as shown in Figure 
11(b). Figure 12 depicted the result of comparison of structural similarity index. 


Lal Td 
(a) (b) 


Figure 11. Words images after solving over-segmentation by step 10, (a) result of solving over-segmentation 
in “oe”, and (b) result of solving over-segmentation in “o” 
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Figure 12. Structure similarity index of merged three consecutive segmented character image with seen 
image 


4. RESULTS AND DISCUSSION 

We utilized a collection of 40 text images to test our approach. This collection contains images from 
KFUPM handwritten Arabic text (KHATT) database, and some images produced by the author. We used a 
horizontal projection profile method for all text images, segmenting each text block into its text lines, then 
using the word segmentation method to segment the text line into words, and finally using the character 
segmentation method to segment all segmented words into individual characters. To assess the efficacy of the 
suggested method, we tested the algorithm on set of 115 images of Arabic printed and handwritten texts, this 
set contains images from the KHATT database [33] some images produced by the author. 

We calculated the success rate of the obtained result, the segmentation errors were divided into three 
categories: bad-segmentation, over-segmented, and miss-segmentation. thus, most errors of the bad- 
segmentation that are common among all fonts occurred in LAMALIF “Y”, “Y” character, and in the 
“Decotype Thuluth”, “Traditional Arabic” and “Advertising Bold” fonts owing the overlapping letters as 
“aP, “a” On the other hand, most errors of the miss-segmentation appear in certain font types owing to tiny 
size font. In fact, at tiny font sizes, the segmentation spots may not be seen and therefore cannot be identified 
since the space among sub-parts is extremely short. It is difficult to compare the findings of the suggested 
segmentation approach to those of other researchers' segmentation approaches published in the literature 
since various researchers presented their segmentation results under different limitations and utilized 
different types of databases. Table 1, Table 2, and Table 3 present the result of text line Segmentation, word 
segmentation, character segmentation, respectively. Figure 13 shows example of wrong character 
segmentation, Figure 14 shows sample of segmentation results. 
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Figure 13. Example of wrong character segmentation 
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Figure 14. Samples of segmentation results 


Table 1. Text line segmentation results 


Count of text image Count text lines Correct segmented Incorrect segmented text line 
used in the experiment in images text line (percentage) (percentage) 
115 575 567 (98.6%) 8 (1.4%) 


Table 2. Word segmentation results 
Count of the words in text lines used Correct segmented word (percentage) Incorrect segmented word (percentage) 
5175 4968 (96%) 207 (4%) 


Table 3. Character segmentation results 
Correct segmented Incorrect segmented Incorrect segmented characters (percentage) 
characters characters 
(percentage) (percentage) 
31050 27046 (87.1%) 4004 (12.9%) 1135 2378 491 


Count of character 


in word images Over-segmented Miss-segmented —§ Bad-segmented 


5. CONCLUSION 

Analyzing text images into text lines, words, and characters is still a hot topic of study, particularly for 
cursive writing. Segmentation of characters is considered as one of the most important phases in OCR systems 
development owing to font differences (e.g., size, type, and style), the presence of complicated types of fonts, 
and character overlapping. The presented segmentation approach reduced the issue of over-segmentation, which 
was evident in open character segmentation, particularly in the characters SEEN, SHEEN, SAD, and DTAAD. 
The proposed approach has shown outstanding results when it comes to segmenting text document images into 
lines and words. This method, on the other hand, has shown outstanding results in the segmentation of ligatures 
that occur between successive closed and opened letters, with assured accurate segmentation in the case of 
printed text document pictures without touching characters. Miss-segmentation problems occurred in certain 
font types, such as "Decotype Thuluth," "Traditional Arabic," and "Advertising Bold," due to overlapping 
characters, in which one letter appears over another, making it impossible for the proposed approach to 
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identifying the ligature between the letters. In future work, there is a need to enhance some of the pre-processing 
methods, e.g., removing noise and slant correction. In addition, enhancing the character segmentation stage, as 
well as replacing the ssim function that was used in step 10 with machine learning techniques. 
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